72 research outputs found

    Regularized Newton Method with Global O(1/k2)O(1/k^2) Convergence

    Full text link
    We present a Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians. We achieve this by merging the ideas of cubic regularization with a certain adaptive Levenberg--Marquardt penalty. In particular, we show that the iterates given by xk+1=xkβˆ’(βˆ‡2f(xk)+Hβˆ₯βˆ‡f(xk)βˆ₯I)βˆ’1βˆ‡f(xk)x^{k+1}=x^k - \bigl(\nabla^2 f(x^k) + \sqrt{H\|\nabla f(x^k)\|} \mathbf{I}\bigr)^{-1}\nabla f(x^k), where H>0H>0 is a constant, converge globally with a O(1k2)\mathcal{O}(\frac{1}{k^2}) rate. Our method is the first variant of Newton's method that has both cheap iterations and provably fast global convergence. Moreover, we prove that locally our method converges superlinearly when the objective is strongly convex. To boost the method's performance, we present a line search procedure that does not need hyperparameters and is provably efficient.Comment: 21 pages, 2 figure

    Adaptive Proximal Gradient Method for Convex Optimization

    Full text link
    In this paper, we explore two fundamental first-order algorithms in convex optimization, namely, gradient descent (GD) and proximal gradient method (ProxGD). Our focus is on making these algorithms entirely adaptive by leveraging local curvature information of smooth functions. We propose adaptive versions of GD and ProxGD that are based on observed gradient differences and, thus, have no added computational costs. Moreover, we prove convergence of our methods assuming only local Lipschitzness of the gradient. In addition, the proposed versions allow for even larger stepsizes than those initially suggested in [MM20]

    Learning-Rate-Free Learning by D-Adaptation

    Full text link
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of DD yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation)

    Prodigy: An Expeditiously Adaptive Parameter-Free Learner

    Full text link
    We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution DD, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of O(log⁑(D/d0))O(\sqrt{\log(D/d_0)}), where d0d_0 is the initial estimate of DD. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam
    • …
    corecore